Summary/Abstract Write a summary of your project.
Outbreaks of infectious disease are an important, yet poorly understood, driving force in population biology and, while prevalent in both terrestrial and marine systems, the manner by which outbreaks influence biotic relationships, age demographics, community structure and function, and trophic interactions in the aquatic realm consistently lags behind that of terrestrial disease ecology. The growing body of evidence in this field, however, supports the observation that marine disease epidemics are increasing in frequency and severity (Orth et al., 2006; Waycott et al., 2009). Elucidating the relationship between aquatic species and their pathogenic diseases is pertinent in the field of marine ecology because infection outbreaks have the potential to drastically alter ecosystem functionality (Blakesley et al., 2002; Burge et al., 2013). Many of the organisms which have faced chronic and/or severe outbreaks of disease (such as sea urchins, scleractinian corals and seagrasses) are also considered keystone species and/or ecosystem engineers, meaning that they contribute highly valuable services or functions to their surrounding community and ecosystem. Thus, it comes as no surprise that disease-driven mass mortalities of these species often generate waves of ecological permutation, ranging from temporary local disruptions to permanent phase shifts (Burge et al., 2013), such as the transformation from a coral to macroalgal dominated Caribbean reef structure that accompanied the massive Diadema antillarum (black sea urchin) die-off of the early 1980s (Lessios et al., 1984; Lessios, 1988). This event, likely caused by a biological pathogen (Schultz et al., 2016), reduced Diadema populations to less than 1% of their original size (Lessios, 1988). Mass mortality events impacting critical foundation species, ecosystem engineers, or keystone species such as the black sea urchin have been coined ‘marine disease emergencies’ due to the detrimental cascade of events that often succeeds them (Miner et al., 2018).
Sea Star Wasting Disease (SSWD) is another epizootic crisis facing modern day coastal ecosystems. Outbreaks of asteroid wasting have been documented periodically since the late 1970s and SSWD describes a suite of symptoms observed across a broad range of sea star species, most of which play integral parts in shaping their community structure. Generally speaking, the wasting disease events which punctuated the past four decades were relatively brief, in localized areas, and largely failed to capture the attention of the scientific community. Beginning in summer 2013, however, mass mortalities of sea stars due to wasting disease have caused unprecedented damage, owing to the geographical and temporal extent of impact. This ongoing epizootic, which has killed millions of asteroids across over 20 taxa, is widely referred to as the largest disease event sweeping through a wildlife marine species in documented history (Hewson et al. 2014, Gudenkauf & Hewson, 2015, Eisenlord et al. 2016).
Describe what the data is, what it contains, where it is from, etc.
The vast majority of disease documentation and surveys cover the west coast of the United States. While it is almost certainly true that SSWD is most prevalent in this region, there is a strong bias towards the discovery of diseased organisms owing to the emphasis of long-term survey networks in the area, such as the extensive Multi-Agency Rocky Intertidal Network (MARINe for short). The vast majority of what we know about SSWD dynamics comes from studies along the Pacific Coast of North America using data produced through the combined effort of dedicated scientists, government agencies, and local citizens. The purpose of this investigation is to make use of this data monitoring system to investigate spatial and temoporal changes in sea star abundance along the west coast of the United States.
Below I explore the raw data I downloaded from a link I recieved after submitting a request for the data on the MARINe webpage.
library(readr)
raw_SS_count <- read.csv("~/Documents/GitHub/PaigeDuffin-Project/data/raw_data/seastarkat_size_count_totals_download.csv")
head(raw_SS_count, 3)
## group_code site_code marine_site_name marine_sort_order latitude
## 1 UCLA ALEG Alegria 6420 34.46714
## 2 UCLA ALEG Alegria 6420 34.46714
## 3 UCLA ALEG Alegria 6420 34.46714
## longitude marine_common_season marine_season_code marine_common_year
## 1 -120.2774 85 SP02 2002
## 2 -120.2774 85 SP02 2002
## 3 -120.2774 85 SP02 2002
## season_sequence season_name target_assemblage method_code species_code
## 1 1 Spring sea_star IP PISOCH
## 2 1 Spring sea_star IP PISOCH
## 3 1 Spring sea_star IP PISOCH
## size_sort_order size_bin total mpa_designation mpa_region georegion
## 1 3 20 1 reference South Coast CA South
## 2 4 30 3 reference South Coast CA South
## 3 5 40 16 reference South Coast CA South
## bioregion state_province island last_updated
## 1 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 2 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 3 Government Point to Mexico California Mainland 2019-04-05 20:07:59
levels(raw_SS_count$georegion)
## [1] "AK" "CA Central"
## [3] "CA Channel Islands South" "CA North"
## [5] "CA North Central" "CA South"
## [7] "OR" "WA Olympic Coast"
## [9] "WA Salish Sea"
levels(raw_SS_count$species_code)
## [1] "EVATRO" "KATTUN" "PISOCH"
Notably, it contains the following information (some parameters removed from this overview):
groupcode: The unique code for each monitoring groupmarine_site_name: The name of the site where the survey was conducted.latitude and longitude: (self explanatory)marine_season_code: A four-character code to identify the Sampling season. The first two characters indicate the season and the last 2 characters indicate the year.method_code: The method used for sampling the plot.size_bin: The size of the species being counted, binned to the nearest 5 or 10 millimeter.target_assemblage: The specific organism that this transect was created to monitor.total: Total number of individuals counted in a given size_binmpa_designation: Describes whether the referenced site is located within a Marine Protected Area (MPA) or is a reference site. If this field is blank, the referenced site is not located within an MPA, or is not a reference site for an existing MPA site.georegion/ bioregion: geographic/biogeographic region in which site is locatedstate_province: The State or Province and Country where the referenced site is located.island: The name of the island where the referenced site is located. Sites not on islands are designated as mainland.I believe I have more data coming in to work with. I recieved an email today from a representative of the Multi-Agency Rocky Intertidal Network. This email said:
“You should have been able to download our sea star size/count data from the data download page that you were directed to after filling out the data request form. We also have associated disease category data, but because we only sample 1x per year at most sites, this additional info will not give you an idea of disease emergence or progression within a site or region. You might also be interested in our observation data set, which would be better for these types of questions. We can have this data to you within about a week.”
I’m also interested in environmental data such as sea surface temperature, dissolved oxygen, salinity, etc., as SSWD is thought to be exacerbated by abiotic stressors. However, I’m having a difficult time finding data to use that will pair reasonably well with the dates and locations of the raw_SS_count data I know I will be working with.
I’m very interested in this disease and environmental data, but don’t know what it will look like or if it will be feasible to use. Hopefully, I can incorporate this into my analysis, but for right now I am going to focus on outlining the questions I can answer with the data I have in hand.
State the research questions you plan to answer with this analysis
1. Are sea star populations changing over time? I anticipate that there will be a decline in sea star populations over time, but that the change will not occur uniformly across geographical regions and species. Species that are reported to have been affected by SSWD will experience the biggest declines.
2. Is species composition uniform across space? I expect to be able to define rough species ranges.
3. Is there evidence of sampling bias? Is there a relationship between important survey parameters (mainly, abundance data) and sampling method (ex. group conducting the survey, plot sampling method, the target assemblage). At the very least, I expect there will be bias based on the targeted organism of the survey; if a given survey aims to characterize the presence of one species of sea stars, it is likely that they will seek out that species and record higher counts.
1. Is there a relationship between sea star abundance over time and prevalence of SSWD? There absolutely should be a relationship, based on the body of evidence presented on studies that have focused on individual species.
2. Are there species of sea stars that are more affected by disease than others? The literature implies that this may be the case, but there also may be a bias based on species that researchers are more interested in.
3. Are there regions that are more affected by disease than others, regardless of species? The most recent epidemic appears to be so widespread that I suspect there won’t be any significant trends here, but it’s certainly worth checking.
4. Is there a relationship between sea star wasting disease and dissolved oxygen levels? A recent, but leading hypothesis in the field of SSWD suggests that it may be triggered specifically by low oxygen conditions. I’m very interested in exploring relationships between SSWD and DO levels- this is actually the main curiosity that lead me to focus in on this data set.
5. Is there a relationship between sea star wasting disease and [other water quality parameters I’m able to find data for]?
I have spent a very long time trying to figure out how to get all of the different components of the .Rproj to “talk” to each other (ie, write script in “processingscript.R”, have it save the processed script to the ./data/processed_data/ folder as an “RDS”) and I can’t figure it out. I looked at the example provided by Dr. Handel/his student in the second module, as well as some of the other students who have submitted already publically on the GitHub page.
In the interest of having something to turn in for this assignment on time, I am going to make this R Markdown file on my progress and turn it into a website, as this is something I do know how to do. I don’t want to complain, but I have to admit I am feeling very frusterated, overwhelmed, and defeated right now :(
Throughout this document, I’ve highlighted places where I’ve had an especially difficult time in font color=“red”>red font.
font color=“red”>Generally speaking, I do not know how to use the bibtex reference approach. I may be able to figure it out on my own, but truly am worried about how much of a time commitment this is all going to be. Are there any resources you can point me towards so I can learn how to use this system?
This is what I would include in my processingscript.R file.
processing script
this script loads the raw data, processes and cleans it and saves it as Rds file in the processed_data folder
load needed packages. make sure they are installed.
library(caret) # Classification and Regression Training
## Warning: package 'caret' was built under R version 3.5.2
## Warning: package 'ggplot2' was built under R version 3.5.2
library(corrplot)
library(dplyr) # A Grammar of Data Manipulations
## Warning: package 'dplyr' was built under R version 3.5.2
library(forcats) # Tools for Working with Categorical Variables (Factors)
## Warning: package 'forcats' was built under R version 3.5.2
library(gdata) # Various R Programming Tools for Data Manipulation
library(ggplot2) # Create Elegant Data Visualizations Using the Grammar of Graphics
library(ggrepel) # Automatically Position Non-Overlapping Text Labels with ‘ggplot2’
## Warning: package 'ggrepel' was built under R version 3.5.2
library(ggridges) # Ridgeline Plots in ‘ggplot2’
library(ggthemes) # Extra Themes, Scales, and Geoms for ‘ggplot2’
## Warning: package 'ggthemes' was built under R version 3.5.2
library(gmodels) # Various R Programming Tools for Model Fitting
library(grid) # The Grid Graphics Package
library(gridExtra)
library(knitr) # A General-Purpose Package for Dynamic Report Generation in R
## Warning: package 'knitr' was built under R version 3.5.2
library(lmtest) # Testing Linear Regression Models
## Warning: package 'lmtest' was built under R version 3.5.2
library(plyr) # Tools for Splitting, Applying and Combining Data
library(purrr) # Functional Programming Tools
## Warning: package 'purrr' was built under R version 3.5.2
library(readr) # Read Rectangular Text Data
library(rpart)
## Warning: package 'rpart' was built under R version 3.5.2
library(rattle)
library(scales) # Scale Functions for Visualization
library(shiny) # Web Application Framework for R
## Warning: package 'shiny' was built under R version 3.5.2
library(skimr) # Compact and Flexible Summaries of Data
## Warning: package 'skimr' was built under R version 3.5.2
library(stringr) # Simple, Consistent Wrappers for Common String Operations
library(tibble) # Simple Data Frames
## Warning: package 'tibble' was built under R version 3.5.2
library(tidyr) # Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions
## Warning: package 'tidyr' was built under R version 3.5.2
library(tidyverse) # Install and Load ‘Tidyverse’ Packages
library(usmap)
## Warning: package 'usmap' was built under R version 3.5.2
library(rworldmap)
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.5.2
library(maps)
library(mapdata)
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.5.2
library("googleway")
library("ggspatial")
library("rnaturalearth")
library("rnaturalearthdata")
theme_set(theme_bw())
library("sf")
## Warning: package 'sf' was built under R version 3.5.2
register_google(key = "AIzaSyB3n0SkJTXxrbUkF2bkBJs7XdleT_iJXsg", write = TRUE)
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.2
library(dplyr)
library(readr)
load data. path is relative to project directory.
# raw_SS_count <- read.csv("./data/raw_data/seastarkat_size_count_totals_download.csv") # already did that, above
head(raw_SS_count, 3)
## group_code site_code marine_site_name marine_sort_order latitude
## 1 UCLA ALEG Alegria 6420 34.46714
## 2 UCLA ALEG Alegria 6420 34.46714
## 3 UCLA ALEG Alegria 6420 34.46714
## longitude marine_common_season marine_season_code marine_common_year
## 1 -120.2774 85 SP02 2002
## 2 -120.2774 85 SP02 2002
## 3 -120.2774 85 SP02 2002
## season_sequence season_name target_assemblage method_code species_code
## 1 1 Spring sea_star IP PISOCH
## 2 1 Spring sea_star IP PISOCH
## 3 1 Spring sea_star IP PISOCH
## size_sort_order size_bin total mpa_designation mpa_region georegion
## 1 3 20 1 reference South Coast CA South
## 2 4 30 3 reference South Coast CA South
## 3 5 40 16 reference South Coast CA South
## bioregion state_province island last_updated
## 1 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 2 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 3 Government Point to Mexico California Mainland 2019-04-05 20:07:59
levels(raw_SS_count$georegion)
## [1] "AK" "CA Central"
## [3] "CA Channel Islands South" "CA North"
## [5] "CA North Central" "CA South"
## [7] "OR" "WA Olympic Coast"
## [9] "WA Salish Sea"
levels(raw_SS_count$species_code)
## [1] "EVATRO" "KATTUN" "PISOCH"
take a look at the data
dplyr::glimpse(raw_SS_count)
## Observations: 13,165
## Variables: 24
## $ group_code <fct> UCLA, UCLA, UCLA, UCLA, UCLA, UCLA, UCLA, U…
## $ site_code <fct> ALEG, ALEG, ALEG, ALEG, ALEG, ALEG, ALEG, A…
## $ marine_site_name <fct> Alegria, Alegria, Alegria, Alegria, Alegria…
## $ marine_sort_order <int> 6420, 6420, 6420, 6420, 6420, 6420, 6420, 6…
## $ latitude <dbl> 34.46714, 34.46714, 34.46714, 34.46714, 34.…
## $ longitude <dbl> -120.2774, -120.2774, -120.2774, -120.2774,…
## $ marine_common_season <int> 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85,…
## $ marine_season_code <fct> SP02, SP02, SP02, SP02, SP02, SP02, SP02, S…
## $ marine_common_year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
## $ season_sequence <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ season_name <fct> Spring, Spring, Spring, Spring, Spring, Spr…
## $ target_assemblage <fct> sea_star, sea_star, sea_star, sea_star, sea…
## $ method_code <fct> IP, IP, IP, IP, IP, IP, IP, IP, IP, IP, IP,…
## $ species_code <fct> PISOCH, PISOCH, PISOCH, PISOCH, PISOCH, PIS…
## $ size_sort_order <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16…
## $ size_bin <int> 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 1…
## $ total <int> 1, 3, 16, 25, 27, 22, 11, 2, 2, 1, 1, 3, 6,…
## $ mpa_designation <fct> reference, reference, reference, reference,…
## $ mpa_region <fct> South Coast, South Coast, South Coast, Sout…
## $ georegion <fct> CA South, CA South, CA South, CA South, CA …
## $ bioregion <fct> Government Point to Mexico, Government Poin…
## $ state_province <fct> California, California, California, Califor…
## $ island <fct> Mainland, Mainland, Mainland, Mainland, Mai…
## $ last_updated <fct> 2019-04-05 20:07:59, 2019-04-05 20:07:59, 2…
head(raw_SS_count, 3)
## group_code site_code marine_site_name marine_sort_order latitude
## 1 UCLA ALEG Alegria 6420 34.46714
## 2 UCLA ALEG Alegria 6420 34.46714
## 3 UCLA ALEG Alegria 6420 34.46714
## longitude marine_common_season marine_season_code marine_common_year
## 1 -120.2774 85 SP02 2002
## 2 -120.2774 85 SP02 2002
## 3 -120.2774 85 SP02 2002
## season_sequence season_name target_assemblage method_code species_code
## 1 1 Spring sea_star IP PISOCH
## 2 1 Spring sea_star IP PISOCH
## 3 1 Spring sea_star IP PISOCH
## size_sort_order size_bin total mpa_designation mpa_region georegion
## 1 3 20 1 reference South Coast CA South
## 2 4 30 3 reference South Coast CA South
## 3 5 40 16 reference South Coast CA South
## bioregion state_province island last_updated
## 1 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 2 Government Point to Mexico California Mainland 2019-04-05 20:07:59
## 3 Government Point to Mexico California Mainland 2019-04-05 20:07:59
levels(raw_SS_count$georegion)
## [1] "AK" "CA Central"
## [3] "CA Channel Islands South" "CA North"
## [5] "CA North Central" "CA South"
## [7] "OR" "WA Olympic Coast"
## [9] "WA Salish Sea"
levels(raw_SS_count$species_code)
## [1] "EVATRO" "KATTUN" "PISOCH"
change many of the variables from factors to characters, at least for now
SS_count <- raw_SS_count
SS_count$group_code <- as.character(SS_count$group_code)
SS_count$site_code <- as.character(SS_count$site_code)
SS_count$marine_site_name <- as.character(SS_count$marine_site_name)
SS_count$marine_season_code <- as.character(SS_count$marine_season_code)
SS_count$season_name <- as.character(SS_count$season_name)
SS_count$target_assemblage <- as.character(SS_count$target_assemblage)
SS_count$method_code <- as.character(SS_count$method_code)
SS_count$species_code <- as.character(SS_count$species_code)
SS_count$mpa_designation <- as.character(SS_count$mpa_designation)
SS_count$mpa_region <- as.character(SS_count$mpa_region)
SS_count$georegion <- as.character(SS_count$georegion)
SS_count$bioregion <- as.character(SS_count$bioregion)
SS_count$state_province <- as.character(SS_count$state_province)
SS_count$island <- as.character(SS_count$island)
SS_count$last_updated <- as.character(SS_count$last_updated)
glimpse(SS_count)
## Observations: 13,165
## Variables: 24
## $ group_code <chr> "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UC…
## $ site_code <chr> "ALEG", "ALEG", "ALEG", "ALEG", "ALEG", "AL…
## $ marine_site_name <chr> "Alegria", "Alegria", "Alegria", "Alegria",…
## $ marine_sort_order <int> 6420, 6420, 6420, 6420, 6420, 6420, 6420, 6…
## $ latitude <dbl> 34.46714, 34.46714, 34.46714, 34.46714, 34.…
## $ longitude <dbl> -120.2774, -120.2774, -120.2774, -120.2774,…
## $ marine_common_season <int> 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85,…
## $ marine_season_code <chr> "SP02", "SP02", "SP02", "SP02", "SP02", "SP…
## $ marine_common_year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
## $ season_sequence <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ season_name <chr> "Spring", "Spring", "Spring", "Spring", "Sp…
## $ target_assemblage <chr> "sea_star", "sea_star", "sea_star", "sea_st…
## $ method_code <chr> "IP", "IP", "IP", "IP", "IP", "IP", "IP", "…
## $ species_code <chr> "PISOCH", "PISOCH", "PISOCH", "PISOCH", "PI…
## $ size_sort_order <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16…
## $ size_bin <int> 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 1…
## $ total <int> 1, 3, 16, 25, 27, 22, 11, 2, 2, 1, 1, 3, 6,…
## $ mpa_designation <chr> "reference", "reference", "reference", "ref…
## $ mpa_region <chr> "South Coast", "South Coast", "South Coast"…
## $ georegion <chr> "CA South", "CA South", "CA South", "CA Sou…
## $ bioregion <chr> "Government Point to Mexico", "Government P…
## $ state_province <chr> "California", "California", "California", "…
## $ island <chr> "Mainland", "Mainland", "Mainland", "Mainla…
## $ last_updated <chr> "2019-04-05 20:07:59", "2019-04-05 20:07:59…
skim(SS_count)
## Skim summary statistics
## n obs: 13165
## n variables: 24
##
## ── Variable type:character ─────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## bioregion 0 13165 13165 13 33 0 6
## georegion 0 13165 13165 2 24 0 9
## group_code 0 13165 13165 2 6 0 9
## island 0 13165 13165 3 9 0 7
## last_updated 0 13165 13165 19 19 0 1
## marine_season_code 0 13165 13165 4 4 0 57
## marine_site_name 0 13165 13165 5 42 0 77
## method_code 0 13165 13165 2 4 0 4
## mpa_designation 0 13165 13165 2 9 0 5
## mpa_region 0 13165 13165 2 19 0 6
## season_name 0 13165 13165 4 6 0 4
## site_code 0 13165 13165 3 4 0 77
## species_code 0 13165 13165 6 6 0 3
## state_province 0 13165 13165 6 10 0 4
## target_assemblage 0 13165 13165 8 8 0 1
##
## ── Variable type:integer ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## marine_common_season 0 13165 13165 116.82 19.23 78 101
## marine_common_year 0 13165 13165 2009.7 4.81 2000 2006
## marine_sort_order 0 13165 13165 5566.24 1125.15 1720 5020
## season_sequence 0 13165 13165 2.03 0.81 1 1
## size_bin 0 13165 13165 82.44 42.2 5 50
## size_sort_order 0 13165 13165 9.57 4.15 1 6
## total 0 13165 13165 9.92 16.43 1 2
## p50 p75 p100 hist
## 118 131 152 ▅▅▆▇▇▇▇▅
## 2010 2013 2018 ▃▅▅▇▇▆▆▆
## 6090 6370 7650 ▁▁▁▂▃▃▇▁
## 2 3 4 ▇▁▇▁▁▇▁▁
## 80 110 320 ▅▇▇▃▁▁▁▁
## 10 12 33 ▅▇▇▃▁▁▁▁
## 4 11 360 ▇▁▁▁▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## latitude 0 13165 13165 38.67 5.2 32.87 34.73 36.62
## longitude 0 13165 13165 -122.25 2.86 -135.38 -123.98 -121.94
## p75 p100 hist
## 41.65 57.05 ▇▆▃▂▁▁▁▁
## -120.62 -117.25 ▁▁▁▁▅▇▆▃
save as RDS file
saveRDS(SS_count, file = “./data/processed_data/”)
font color=“red”>I don’t understand this step. After much trial and error, it continues to not work for me. Can you provide additional instruction on how to do this step? I keep recieving the error message: Error in gzfile(file, mode) : cannot open the connection. In addition: Warning message: In gzfile(file, mode) : cannot open compressed file ‘./data/processed_data/’, probable reason ‘Is a directory’, but when I write “./data/processed_data/processed_SS_count”, it gives me a different error
load necessary packages
library(caret) # Classification and Regression Training
library(corrplot)
library(dplyr) # A Grammar of Data Manipulations
library(forcats) # Tools for Working with Categorical Variables (Factors)
library(gdata) # Various R Programming Tools for Data Manipulation
library(ggplot2) # Create Elegant Data Visualizations Using the Grammar of Graphics
library(ggrepel) # Automatically Position Non-Overlapping Text Labels with ‘ggplot2’
library(ggridges) # Ridgeline Plots in ‘ggplot2’
library(ggthemes) # Extra Themes, Scales, and Geoms for ‘ggplot2’
library(gmodels) # Various R Programming Tools for Model Fitting
library(grid) # The Grid Graphics Package
library(gridExtra)
library(knitr) # A General-Purpose Package for Dynamic Report Generation in R
library(lmtest) # Testing Linear Regression Models
library(plyr) # Tools for Splitting, Applying and Combining Data
library(purrr) # Functional Programming Tools
library(readr) # Read Rectangular Text Data
library(rpart)
library(rattle)
library(scales) # Scale Functions for Visualization
library(shiny) # Web Application Framework for R
library(skimr) # Compact and Flexible Summaries of Data
library(stringr) # Simple, Consistent Wrappers for Common String Operations
library(tibble) # Simple Data Frames
library(tidyr) # Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions
library(tidyverse) # Install and Load ‘Tidyverse’ Packages
library(usmap)
library(rworldmap)
library(ggmap)
library(maps)
library(mapdata)
library(cowplot)
library("googleway")
library("ggspatial")
library("rnaturalearth")
library("rnaturalearthdata")
theme_set(theme_bw())
library("sf")
register_google(key = "AIzaSyB3n0SkJTXxrbUkF2bkBJs7XdleT_iJXsg", write = TRUE)
## Replacing old key (AIzaSyB3n0SkJTXxrbUkF2bkBJs7XdleT_iJXsg) with new key in /Users/Paige/.Renviron
SS_count %>% ggplot() +
geom_histogram(aes(total))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
unique_sites <- subset(raw_SS_count, !duplicated(raw_SS_count$site_code))
sites_coordinates <- unique_sites[c(2,5:6,15,22)]
sites_coordinates_contig <- filter(sites_coordinates, sites_coordinates$state_province != "Alaska")
CA_coordinates <- filter(sites_coordinates, sites_coordinates$state_province == "California")
OR_coordinates <- filter(sites_coordinates, sites_coordinates$state_province == "Oregon")
WA_coordinates <- filter(sites_coordinates, sites_coordinates$state_province == "Washington")
AK_coordinates <- filter(sites_coordinates, sites_coordinates$state_province == "Alaska")
CA_coordinates <- as.data.frame(CA_coordinates)
OR_coordinates <- as.data.frame(OR_coordinates)
WA_coordinates <- as.data.frame(WA_coordinates)
AK_coordinates <- as.data.frame(AK_coordinates)
After messing around with several different mapping tools in R, I decided that trying to display the whole range (CA to AK) on one map would be really hard, and the Alaska sites are really just at one specific location, so I’m going to split Alaska off from the three states that are part of the contiguous US #### Plot AK sites
world <- ne_countries(scale = "medium", returnclass = "sf")
AK_zoom1 <- ggplot(data = world) +
geom_sf() +
geom_point(data= sites_coordinates,aes(x=sites_coordinates$longitude, y=sites_coordinates$latitude,), size = 2, alpha = 0.4) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-170, -128), ylim = c(54, 72), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude") +
scale_x_continuous(breaks = c(-160, -150, -140))
AK_zoom2 <- ggplot(data = world) +
geom_sf() +
geom_point(data= sites_coordinates,aes(x=sites_coordinates$longitude, y=sites_coordinates$latitude,), size = 2, alpha = 0.4) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-144, -130), ylim = c(55, 62), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude") +
scale_x_continuous(breaks = c(-142, -138, -134))
grid.arrange(AK_zoom1, AK_zoom2, ncol = 2)
## Scale on map varies by more than 10%, scale bar may be inaccurate
## Scale on map varies by more than 10%, scale bar may be inaccurate
State <- sites_coordinates_contig$state_province
ggplot(data = world) +
geom_sf() +
geom_jitter(data= sites_coordinates_contig,aes(x=sites_coordinates_contig$longitude, y=sites_coordinates_contig$latitude,col=State), size = 2, alpha = 0.6, width = 0.2) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-128, -115), ylim = c(32, 49.5), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude")
## Scale on map varies by more than 10%, scale bar may be inaccurate
CA <- SS_count %>% select(marine_site_name, marine_common_year, size_bin, total, state_province) %>%
filter(state_province == "California")
summary(CA)
## marine_site_name marine_common_year size_bin total
## Length:10548 Min. :2000 Min. : 5.00 Min. : 1.000
## Class :character 1st Qu.:2005 1st Qu.: 50.00 1st Qu.: 2.000
## Mode :character Median :2009 Median : 80.00 Median : 5.000
## Mean :2009 Mean : 81.92 Mean : 9.607
## 3rd Qu.:2013 3rd Qu.:110.00 3rd Qu.: 11.000
## Max. :2018 Max. :230.00 Max. :360.000
## state_province
## Length:10548
## Class :character
## Mode :character
##
##
##
OR <- SS_count %>% select(marine_site_name, marine_common_year, size_bin, total, state_province) %>%
filter(state_province == "Oregon")
summary(OR)
## marine_site_name marine_common_year size_bin total
## Length:1386 Min. :2000 Min. : 5.00 Min. : 1.00
## Class :character 1st Qu.:2005 1st Qu.: 50.00 1st Qu.: 2.00
## Mode :character Median :2010 Median : 80.00 Median : 7.00
## Mean :2010 Mean : 87.21 Mean : 15.28
## 3rd Qu.:2014 3rd Qu.:120.00 3rd Qu.: 18.00
## Max. :2018 Max. :230.00 Max. :212.00
## state_province
## Length:1386
## Class :character
## Mode :character
##
##
##
WA <- SS_count %>% select(marine_site_name, marine_common_year, size_bin, total, state_province) %>%
filter(state_province == "Washington")
summary(WA)
## marine_site_name marine_common_year size_bin total
## Length:865 Min. :2008 Min. : 5.00 Min. : 1.000
## Class :character 1st Qu.:2011 1st Qu.: 50.00 1st Qu.: 1.000
## Mode :character Median :2014 Median : 90.00 Median : 2.000
## Mean :2014 Mean : 89.26 Mean : 6.776
## 3rd Qu.:2016 3rd Qu.:120.00 3rd Qu.: 5.000
## Max. :2018 Max. :320.00 Max. :133.000
## state_province
## Length:865
## Class :character
## Mode :character
##
##
##
AK <- SS_count %>% select(marine_site_name, marine_common_year, size_bin, total, state_province) %>%
filter(state_province == "Alaska")
summary(AK)
## marine_site_name marine_common_year size_bin total
## Length:366 Min. :2011 Min. : 5.00 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 40.00 1st Qu.: 1.00
## Mode :character Median :2015 Median : 60.00 Median : 3.00
## Mean :2015 Mean : 63.51 Mean : 6.03
## 3rd Qu.:2017 3rd Qu.: 87.50 3rd Qu.: 7.00
## Max. :2018 Max. :190.00 Max. :117.00
## state_province
## Length:366
## Class :character
## Mode :character
##
##
##
[INSERT TABLE SUMMARIZING THIS INFORMATION]
SS_count %>% ggplot(aes(marine_common_year, total)) + geom_jitter(width = 0.15, alpha = 0.25)
SS_count %>% ggplot(aes(marine_common_year, total, col=state_province)) + geom_jitter(width = 0.15, alpha = 0.25)
Woah, California is dominating this visualization.
SS_count %>%
ggplot(aes(marine_common_year, total, col=state_province)) + geom_jitter(width = 0.15, alpha = 0.25) +
facet_wrap(~state_province)
Washington and Alaska clearly weren’t surveyed for all years during which California and Oregon were.
SS_count %>%
ggplot(aes(marine_common_year, log(total), col=state_province)) + geom_jitter(width = 0.15, alpha = 0.25) +
facet_wrap(~state_province)
SS_count %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year)) + geom_boxplot()
PISOCH_only <- filter(SS_count, species_code == "PISOCH")
KATTUN_only <- filter(SS_count, species_code == "KATTUN")
EVATRO_only <- filter(SS_count, species_code == "EVATRO")
skim(PISOCH_only)
## Skim summary statistics
## n obs: 11416
## n variables: 24
##
## ── Variable type:character ─────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## bioregion 0 11416 11416 13 33 0 6
## georegion 0 11416 11416 2 24 0 9
## group_code 0 11416 11416 2 6 0 9
## island 0 11416 11416 3 9 0 7
## last_updated 0 11416 11416 19 19 0 1
## marine_season_code 0 11416 11416 4 4 0 57
## marine_site_name 0 11416 11416 5 42 0 77
## method_code 0 11416 11416 2 4 0 4
## mpa_designation 0 11416 11416 2 9 0 5
## mpa_region 0 11416 11416 2 19 0 6
## season_name 0 11416 11416 4 6 0 4
## site_code 0 11416 11416 3 4 0 77
## species_code 0 11416 11416 6 6 0 1
## state_province 0 11416 11416 6 10 0 4
## target_assemblage 0 11416 11416 8 8 0 1
##
## ── Variable type:integer ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## marine_common_season 0 11416 11416 114.4 19.14 78 99
## marine_common_year 0 11416 11416 2009.09 4.79 2000 2005
## marine_sort_order 0 11416 11416 5682.02 1048.93 1720 5050
## season_sequence 0 11416 11416 2.03 0.84 1 1
## size_bin 0 11416 11416 86.83 42.67 5 50
## size_sort_order 0 11416 11416 9.68 4.27 1 6
## total 0 11416 11416 10.85 17.41 1 2
## p50 p75 p100 hist
## 114 130 152 ▅▅▆▇▆▆▅▃
## 2009 2013 2018 ▃▅▅▇▆▆▃▅
## 6150 6390 7650 ▁▁▁▂▃▂▇▁
## 2 3 4 ▇▁▆▁▁▇▁▁
## 90 120 320 ▅▇▇▅▁▁▁▁
## 10 13 33 ▅▇▇▅▁▁▁▁
## 5 13 360 ▇▁▁▁▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## latitude 0 11416 11416 38.13 4.82 32.87 34.55 36.51
## longitude 0 11416 11416 -121.94 2.57 -135.38 -123.51 -121.91
## p75 p100 hist
## 40.34 57.05 ▇▅▂▂▁▁▁▁
## -120.61 -117.25 ▁▁▁▁▅▇▇▅
skim(KATTUN_only)
## Skim summary statistics
## n obs: 1676
## n variables: 24
##
## ── Variable type:character ─────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## bioregion 0 1676 1676 13 33 0 4
## georegion 0 1676 1676 2 16 0 7
## group_code 0 1676 1676 2 6 0 6
## island 0 1676 1676 3 9 0 5
## last_updated 0 1676 1676 19 19 0 1
## marine_season_code 0 1676 1676 4 4 0 27
## marine_site_name 0 1676 1676 5 42 0 46
## method_code 0 1676 1676 2 2 0 1
## mpa_designation 0 1676 1676 2 9 0 5
## mpa_region 0 1676 1676 2 19 0 5
## season_name 0 1676 1676 4 6 0 3
## site_code 0 1676 1676 3 4 0 46
## species_code 0 1676 1676 6 6 0 1
## state_province 0 1676 1676 6 10 0 4
## target_assemblage 0 1676 1676 8 8 0 1
##
## ── Variable type:integer ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## marine_common_season 0 1676 1676 132.29 10.05 114 123 133
## marine_common_year 0 1676 1676 2013.57 2.54 2009 2011 2014
## marine_sort_order 0 1676 1676 4911.32 1220.64 1720 4400 5050
## season_sequence 0 1676 1676 2.01 0.62 1 2 2
## size_bin 0 1676 1676 53.79 23.85 5 35 50
## size_sort_order 0 1676 1676 8.95 3.05 1 7 9
## total 0 1676 1676 3.9 3.26 1 1 3
## p75 p100 hist
## 141 150 ▆▇▅▆▇▇▅▆
## 2016 2018 ▇▇▅▆▇▇▆▇
## 6130 6310 ▂▁▁▁▂▇▁▆
## 2 3 ▂▁▁▇▁▁▁▂
## 70 130 ▃▃▇▅▇▂▁▁
## 11 17 ▂▂▃▇▇▅▁▁
## 5 19 ▇▂▂▁▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## latitude 0 1676 1676 41.73 5.77 35.45 36.56 40.34
## longitude 0 1676 1676 -124.07 3.33 -135.38 -124.14 -123.73
## p75 p100 hist
## 44.24 57.05 ▇▅▆▂▃▁▁▂
## -121.95 -120.95 ▁▁▁▁▁▁▇▆
skim(EVATRO_only)
## Skim summary statistics
## n obs: 73
## n variables: 24
##
## ── Variable type:character ─────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## bioregion 0 73 73 13 26 0 2
## georegion 0 73 73 2 13 0 2
## group_code 0 73 73 2 6 0 5
## island 0 73 73 3 9 0 6
## last_updated 0 73 73 19 19 0 1
## marine_season_code 0 73 73 4 4 0 10
## marine_site_name 0 73 73 9 42 0 10
## method_code 0 73 73 2 2 0 1
## mpa_designation 0 73 73 4 4 0 1
## mpa_region 0 73 73 4 4 0 1
## season_name 0 73 73 4 6 0 3
## site_code 0 73 73 3 4 0 10
## species_code 0 73 73 6 6 0 1
## state_province 0 73 73 6 10 0 2
## target_assemblage 0 73 73 8 8 0 1
##
## ── Variable type:integer ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## marine_common_season 0 73 73 140.75 9.19 114 134 142
## marine_common_year 0 73 73 2015.7 2.28 2009 2014 2016
## marine_sort_order 0 73 73 2496.03 833.39 1720 1720 1740
## season_sequence 0 73 73 1.96 0.26 1 2 2
## size_bin 0 73 73 54.11 28.41 5 30 50
## size_sort_order 0 73 73 6.4 2.87 1 4 6
## total 0 73 73 1.84 1.83 1 1 1
## p75 p100 hist
## 150 150 ▁▁▁▂▃▁▃▇
## 2018 2018 ▁▁▁▂▃▁▃▇
## 3290 3850 ▇▁▁▁▁▃▃▁
## 2 3 ▁▁▁▇▁▁▁▁
## 70 160 ▅▇▆▅▅▁▁▁
## 8 17 ▅▇▆▅▅▁▁▁
## 2 13 ▇▁▁▁▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## latitude 0 73 73 52.86 4.5 47.58 48.52 56.99 57.05
## longitude 0 73 73 -129.42 6.4 -135.38 -135.35 -135.32 -122.55
## p100 hist
## 57.05 ▇▁▁▁▁▁▁▇
## -122.52 ▇▁▁▁▁▁▁▇
PISOCH_only %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year)) + geom_boxplot()
KATTUN_only %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year)) + geom_boxplot()
EVATRO_only %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year)) + geom_boxplot()
That last one looked like the log() function might not be applicable. Let’s see what it looks like without that:
EVATRO_only %>% ggplot(aes(marine_common_year, total, group = marine_common_year)) + geom_boxplot()
The most recent and catastrophic sea star wasting disease outbreak hit the west coast of North America around 2013. How did sea star abundance look before the epidemic, how did it change during the epidemic, and have the populations rebounded?
last_decade <- filter(SS_count, marine_common_year >= "2010")
last_decade %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year)) + geom_boxplot()
It looks like populations leveled back out at 2016 and remained relatively unchanged in 2017 and 2018.
It looks like theres a detectable decrease between 2013 and 2014, and that populations leveled back out at 2016 and remained relatively unchanged in 2017 and 2018. Lets narrow the time frame to only include sampling between 2013 and 2016.
last_decade$marine_common_year <- as.factor(last_decade$marine_common_year)
SSWD_timeframe <- filter(SS_count, marine_common_year >= "2013" & marine_common_year <= "2016")
SSWD_timeframe %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year, col=marine_common_year)) + geom_boxplot()
SSWD_timeframe %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year, col=marine_common_year)) + geom_boxplot() + facet_wrap(~state_province, 1) + theme(legend.position = "none")
SSWD_timeframe %>% ggplot(aes(marine_common_year, log(total), col=state_province)) + geom_jitter(width = 0.45, alpha = 0.25) +
facet_wrap(~state_province, 1) + geom_smooth(method = "lm", col="black") + theme(legend.position = "none")
SSWD_timeframe_CA <- filter(SSWD_timeframe, state_province == "California")
SSWD_timeframe_CA %>% ggplot(aes(marine_common_year, log(total), col=log(total))) + geom_jitter(width = 0.5, alpha = 0.25) + geom_smooth(method = "lm", col="red")
SSWD_timeframe_PIS <- filter(PISOCH_only, marine_common_year >= "2013" & marine_common_year <= "2016")
SSWD_timeframe_PIS %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year, col=marine_common_year)) + geom_boxplot() + facet_wrap(~state_province, 1) + theme(legend.position = "none")
SSWD_timeframe_KAT <- filter(KATTUN_only, marine_common_year >= "2013" & marine_common_year <= "2016")
SSWD_timeframe_KAT %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year, col=marine_common_year)) + geom_boxplot() + facet_wrap(~state_province, 1) + theme(legend.position = "none")
SSWD_timeframe_EVA <- filter(EVATRO_only, marine_common_year >= "2013" & marine_common_year <= "2016")
SSWD_timeframe_EVA %>% ggplot(aes(marine_common_year, log(total), group = marine_common_year, col=marine_common_year)) + geom_boxplot() + facet_wrap(~state_province, 1) + theme(legend.position = "none")
OH! I didn’t catch that the reason why the EVA species had such few survey sightings was because its range appears to only be in Alaska. Speaking of which….
Species <- SSWD_timeframe$species_code
ggplot(data = world) +
geom_sf() +
geom_jitter(data= SSWD_timeframe,aes(x=SSWD_timeframe$longitude, y=SSWD_timeframe$latitude,col=Species), size = 2, alpha = 0.3, width = 0.2) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-128, -115), ylim = c(32, 49.5), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude")
## Scale on map varies by more than 10%, scale bar may be inaccurate
AK_zoom1_SSWD <- ggplot(data = world) +
geom_sf() +
geom_point(data= SSWD_timeframe,aes(x=SSWD_timeframe$longitude, y=SSWD_timeframe$latitude,col=Species,), size = 2, alpha = 0.4) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-140, -130), ylim = c(55, 62), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude")
AK_zoom2_SSWD <- ggplot(data = world) +
geom_sf() +
geom_point(data= SSWD_timeframe,aes(x=SSWD_timeframe$longitude, y=SSWD_timeframe$latitude,col=Species,), size = 2, alpha = 0.4) +
annotation_scale(location = "bl", width_hint = 0.3) +
coord_sf(xlim = c(-136, -135), ylim = c(56.6, 57.4), expand = FALSE) +
xlab("Longitude") +
ylab("Latitude")
grid.arrange(AK_zoom1_SSWD, AK_zoom2_SSWD, ncol = 2)
## Scale on map varies by more than 10%, scale bar may be inaccurate
In most research papers, results and methods are separate. You can combine them here if you find it easier. You are also welcome to structure things such that those are separate sections.
As applicable, explain where and how you got the data. If you directly import the data from an online source, you can combine this section with the next.
Write code that reads in the file and cleans it so it’s ready for analysis. Since this will be fairly long code for most datasets, it might be a good idea to have it in one or several R scripts. If that is the case, explain here briefly what each file does. The files themselves should be commented well so everyone can follow along.
Use a combination of text/tables/figures to explore and describe your data. You should produce plots or tables or other summary quantities for most of your variables. You definitely need to do it for the important variables, i.e. if you have main exposure or outcome variables, those need to be explored. Depending on the total number of variables in your dataset, explore all or some of the others.
Create plots or tables and compute simple statistics (e.g. t-tests, simple regression model with 1 predictor, etc.) to look for associations between your outcome(s) and each individual predictor variable
Use one or several suitable statistical/machine learning methods to analyze your data and to produce meaningful figures, tables, etc. This might again be code that is best placed in one or several separate R scripts that need to be well documented. You can then load the results produced by this code
Summarize what you did, what you found and what it means.
Discuss what you perceive as strengths and limitations of your analysis.
What are the main take-home messages?
Include citations in your Rmd file using bibtex, the list of references will automatically be placed at the end